This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
The idea here is to use exploratory data analysis in order to be able to show faulty borrowers in a better light, and be able to distinguish some or their most remarkable characteristics apart from those who do succeed to keep up on their payments, and finish.
By this data, does employment status at the time of listing have any visible correlation with loan status?
It is important to mention that the dataset distinguishes between borrowers that are ‘Employed’, ‘Full time’ and ‘Part time’; this leads one to believe that this data maybe faulty since one category really is a set made up by the other two.
Here, it is easy to notice that the greater bulk of borrowers are made up of those who were employed at the time of loan listing. this group is made up of ‘Employed’, ‘Full-time’, ‘Self-employed’ in that order. We can also see that there are too many field names to fit on this small graph frame. we can use the theme function to tilt these labels, and the fill aesthetic for coloring, like so:
We notice here that there is an extra column in our histogram with no label. so we use the unique function to create a vector, named “c”, which will contain all the unique labels of our rows.
You can see that there is a graph column with no name (or “”). to get rid of this we save a subset of our initial data set to a variable, without these rows with values.
Now, let’s Graph a few other variables
## EmploymentStatus
## Employed :67322
## Full-time :24872
## Self-employed: 6029
## Not available: 2959
## Other : 2474
## Part-time : 1002
## (Other) : 1563
In our dataset our employed borrowers take the lead as the biggest group of borrowers of all emplyment status groups. The interesting part is that the “Employed” status should be logically be composed of subcategories, like “Full-time”, “Other”, “Part-time” or “Self-employed”; instead it is accounted for as a whole different state.
grid.arrange(ggplot(aes(x=LoanOriginalAmount),
data = csv_file.reassesedEmp)+
geom_histogram(bins = 30),
ggplot(aes(x=1, y=LoanOriginalAmount),
data = csv_file.reassesedEmp)+
geom_boxplot(), nrow=1)
summary(csv_file.reassesedEmp["LoanOriginalAmount"])
## LoanOriginalAmount
## Min. : 1000
## 1st Qu.: 4000
## Median : 7000
## Mean : 8527
## 3rd Qu.:12000
## Max. :35000
It is interesting to see the outliers in this distribution. It makes one wonder as to why these $150,000 loans are so popular in the data. Aside from the outliers a right tailed destribution can be perceived, where the majority of loan values sit at the lower dollar-value end.
qplot(CreditScoreRangeLower, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["CreditScoreRangeLower"])
## CreditScoreRangeLower
## Min. : 0.0
## 1st Qu.:660.0
## Median :680.0
## Mean :689.5
## 3rd Qu.:720.0
## Max. :880.0
Here we get a notion of the worst recorded credit health of the borrowers. It is nice to see that the bulk of the scores sit in between 625~750, which can be considered fair to excellent.
qplot(CreditScoreRangeUpper, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["CreditScoreRangeUpper"])
## CreditScoreRangeUpper
## Min. : 19.0
## 1st Qu.:679.0
## Median :699.0
## Mean :708.5
## 3rd Qu.:739.0
## Max. :899.0
Once again we notice the bulk-majority of our scores at the fair-to-excellent range, this time peeking towards the higher scores It’s nice to see such a healthy consistency among loaners’ scores.
qplot(CreditScoreRange, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["CreditScoreRange"])
## CreditScoreRange
## Min. :19
## 1st Qu.:19
## Median :19
## Mean :19
## 3rd Qu.:19
## Max. :19
This is a true oddity, how is it possible tha all of our borrowers credit scores have only varied by 19 points the entire lifetime before withdrawing a loan? Is this a mistake? if so, was it machine-made or man-made? was it intentional? (After the 2008 financial crisis, official investigations uncovered evidence of banks coaching or guiding soon-to-be borrower to answer their questionaires to be more likely to be approved)
qplot(BorrowerState, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["BorrowerState"])
## BorrowerState
## CA :14406
## TX : 6617
## NY : 6591
## FL : 6583
## IL : 5840
## GA : 4853
## (Other):61331
It seems as if more populars states to live in received the greatest amounts of loans. This notion may make sense since the more people that live in a state, the more people to ask for loans there are, with California taking the lead on this one by over twice the amount of loans of the next state in line, Texas.
qplot(Occupation, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["Occupation"])
## Occupation
## Other :27487
## Professional :13122
## Computer Programmer : 4243
## Executive : 4189
## Teacher : 3649
## Administrative Assistant: 3547
## (Other) :49984
Since there ae a lot more professions other than the ones listed here, it may be understandable that most loans were given to borrowers with “Other” occupations. The next category, “Professional”, may be a subjective title, which once again brings me to the idea that there is a presence of fraud in this data. Now, in third place we programmers. Not surprising since we are currently living in the information era.
qplot(LoanStatus, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
summary(csv_file.reassesedEmp["LoanStatus"])
## LoanStatus
## Current :55272
## Completed :33768
## Chargedoff :10833
## Defaulted : 4091
## Past Due (1-15 days) : 798
## Past Due (31-60 days): 358
## (Other) : 1101
It is encouraging to see that the majority of loans are Completed or current. However, the presence of fraud is still looming over all of the current loans, the amount of loans defaulted and charged-off.
The fill aesthetic can be used to add color difference among levels of each factor variable on their respective graphs, but it wasn’t used here for the sake of simplicity, it will be used later on.
My dataset is made up of 26 of the 81 total variables of the original dataset. I cut them short in the interest of time and memory space. Although more data is always better it seemed justifiable to cut this dataset to investigate a few variables at first since the discoveries in relations among so many variables could very well take a great amount of time to understand.
My main features of interests are LoanStatus and EmploymentStatus, in order to find the best signs of borrowers defaulting on their payments, since logic follows that you will not have money to pay a loan if you do not have a job, which is most people’s main source of income. Other features of interest may be: the type of job(Occupation) a borrower has, since different careers can make different amounts of money.Loan amount (LoanOriginalAmount) since it is harder to pay a bigger loan than it is a small one. Also, credit score, since it is a score system based on previous information of a borrower’s likely of paying previous debt.
I created a new variable called CreditScoreRange which contains the range between the maximum credit score, and minimum credit score. I believe that there is a negative correlation between a borrower’s likelyhood to pay off their loan and the magnitude of this number.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
As we move through these variables and graph their count per level, we noticed that certain categories in each graph are more likely to posess a higher count than others. For example, the employment graph demonstrates that most loans are given to those who were employed at the time the borrowers asked for the loan. The CreditScoreRangeLowre and CreditScoreRangeUpper categories show that most loan pertain to those borrower with credit scores roughly in the range of 500 to 750. the BorrowerState shows that California is the state with most borrowed loans. In the occupation variable graph we see that there is a high quantity of borrowers who wrote down ‘Other’ as their occupation, and the most popular loan Statuses are ‘Chargedoff’, ‘Completed’ and ‘Current’, as demonstrated in the LoanStatus graph.
We can also use the fill aesthetic in order to distinguish among employment statuses in the count vs LoanStatus graph:
## Warning: Ignoring unknown parameters: binwidth, bins, pad
At a quick glance, it is noticble how most of these loan-status categories are ruled by a greater majority of employed borrowers, except by “Completed” and “Chargedoff” categories, which is somehow a contradictory.
Let’s take a look at those groups that have not comply with their payment deadlines. These include: “Defaulted” and all variations of “Past Due”.
## Employed Full-time Not available Not employed
## 0 675 28 0 9
## Other Part-time Retired Self-employed
## 38 2 2 44
Once more, due to the ambiguity of the description “Employed” we cannot rely on this information completely. “Employed” should be a column of its own containing True and False arguments, with the other categories in this data column being in a different one. However, assuming that “Employed” encapsules other values, such as:
“Full-time” “Part-time” “Self-employed”
then it is surprising that they dominate the over this loan status categories, given that such categories imply an income with which a loan is payed.
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Past Due (16-30 days)'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Past Due (16-30 days)"))
## Employed Full-time Not available Not employed
## 0 226 7 0 5
## Other Part-time Retired Self-employed
## 12 1 0 13
The same arguments as the above chart’s are valid here, since distribution of emplyment status accross loan status are eerily similar; with the exception of full-time employed borrowers being relatively smaller in this instance.
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Past Due (31-60 days)'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Past Due (31-60 days)"))
## Employed Full-time Not available Not employed
## 0 293 13 0 3
## Other Part-time Retired Self-employed
## 23 2 2 22
Once more, the pattern repeats, the relative sizes of employed borrower categories do not differ much as we ascend in the length of time delay in loan payments. This again keeps the presence of fraud afloat in this data. Why would borrowers who have jobs be late on their payments?
The next couple of graphs are too similar to these past ones therefore there is not a lot of new information we could extract from them:
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Past Due (61-90 days)'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Past Due (61-90 days)"))
## Employed Full-time Not available Not employed
## 0 262 11 0 6
## Other Part-time Retired Self-employed
## 13 1 1 19
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Past Due (91-120 days)'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Past Due (91-120 days)"))
## Employed Full-time Not available Not employed
## 0 255 12 0 3
## Other Part-time Retired Self-employed
## 18 2 1 13
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Past Due (>120 days)'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Past Due (>120 days)"))
## Employed Full-time Not available Not employed
## 0 14 0 0 0
## Other Part-time Retired Self-employed
## 0 0 0 2
Ultimately, it is surprising to see that these two categories of borrower are here. Even thou they are small in quantity compared to previous loan status categories, it is still concerning that they are present at all.
ggplot(aes(x=LoanStatus,
fill=factor(EmploymentStatus)),
data = subset(csv_file.reassesedEmp,
LoanStatus=='Defaulted'))+
geom_bar(position='dodge')+
scale_y_log10()
summary(subset(csv_file.reassesedEmp$EmploymentStatus,
csv_file.reassesedEmp$LoanStatus ==
"Defaulted"))
## Employed Full-time Not available Not employed
## 0 630 2050 985 25
## Other Part-time Retired Self-employed
## 54 56 54 237
We can see that Employed borrowers make up the majority of all “Past Due” bars. Whereas “Full-time” and “Not Available” make up the majority of the “Defaulted” category.
Let’s use the same strategy to contrast more than one supporting variable at a time. Here we did not use all of the variables that we graphed previously since some of these variable are continuos, integer data.
a <-sort(table(csv_file.reassesedEmp$BorrowerState),decreasing=TRUE)[1:10]
These are the top ten states with the most amount of loans.
## AK AL AR AZ CA CO CT DC DE FL GA
## 0 0 0 0 0 14406 0 0 0 0 6583 4853
## HI IA ID IL IN KS KY LA MA MD ME MI
## 0 0 0 5840 0 0 0 0 0 0 0 3477
## MN MO MS MT NC ND NE NH NJ NM NV NY
## 0 0 0 0 0 0 0 0 3053 0 0 6591
## OH OK OR PA RI SC SD TN TX UT VA VT
## 4139 0 0 0 0 0 0 0 6617 0 3233 0
## WA WI WV WY
## 0 0 0 0
Although differen in size, the spread of the top 10 borrowing state seems to be consistentamong loan statuses, i.e. the states seem to have relative equal varieties of loan statuses as each other.
Bivariate analysis a very simple way to find the spread of categories, like those of “EmploymentStatus” accross the body of another variable’s magnitude, LoanStatus vs Count.
We could also employ ggplot’s facet_wrap in order to make different graphs of any categorical variable vs Count accross all levels of “LoanStatus”
I personally really enjoy using facet wrap since it is both an efficient and visually complex graphing tool. Here, we better compare different loan statuses side by side in terms of different employment stauses.
qplot(LoanOriginalAmount,
fill = LoanStatus,
bins = 35, data = csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
scale_y_log10()+
facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 175 rows containing missing values (geom_bar).
ggplot(aes(x=LoanStatus,
y=LoanOriginalAmount,
fill=LoanStatus),
data=csv_file.reassesedEmp)+
scale_y_log10()+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))+
geom_violin()
Here we examine the spread of instances of each loan amount amongst all possible loan status. We can see that the Status such as “Chargedoff”, “Completed” and “Defaulted” have a greater variance in loan amounts. Whereas the others, except for “Cancelled”, hang at the top of the spectrum with greater original amount values.
qplot(CreditScoreRangeLower, fill = LoanStatus, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
scale_y_log10()+
facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 283 rows containing missing values (geom_bar).
In this graph there is more consistency across one of our continuous variables, leading to better shaped distributions of instances of our loans. In other words, our number of loans accross any loan status is more symmetrically distributed along our “CreditScoreRangeLower” variable then it was by “LoanAmount”. However, it is interesting to see that in the category graphs “Current” and “Completed” the bulk of loan count amounts, in terms of the lower range of borrower credit scores, peeks on the lower credit score side. In these two categorie, being on the lower side made someone more likely to get a loan, according to this info.
qplot(CreditScoreRangeUpper, fill = LoanStatus, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
scale_y_log10()+
facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 289 rows containing missing values (geom_bar).
As in the graph above, we see the same borrower pattern. We see that more loans were granted at the lower end of the “CreditScoreRangeUpper” range. Which again shows a relation with between more loans as credit score lowers. Let’s keep this in mind as we look for more signs that may point towards fraud as a possible variable.
qplot(CreditScoreRange, fill = LoanStatus, bins = 35, data =
csv_file.reassesedEmp)+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))+
scale_y_log10()+
facet_wrap(~LoanStatus)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 408 rows containing missing values (geom_bar).
This section really grabs should grab the readers attaention since we are finding that all possible borrowers either had no difference in value of their credit score (possibly indicating that these borrower, in their entire credit history have never been able to improve nor worsen their credit score) or that all other borrowers have all only affected their credit score by, exclusively +/- 19 points. These two sceneraios should be unlikely since credit scores are not static, iven a long enough amount of time, and definitely not change only by a constant value of 19.
Here we get a more discrete representation of each supporting variables’ description against the LoanStatus. It is easier to distinguish each column of count per supporting variable’s level against the grouping by LoanStatus’ levels than it was before. The fill aesthetic argument was introduced to make the bars more distinguishable.
By simultaneously contrasting two supporting variables we can see their relationship with one another. When we analyzed individual columns of the LoanStatus vs EmploymentStatus we noticed that the main body of all “Past Due” variations are made up of employed borrower, and the “Defaulted” category is made up mainly of borrower who marke their employment status as “Full-time” and “Not Available”.
Yes, in the brainstorm section at the end of this report I visualized the relations between variables ‘CreditScoreRangeLower vs BorrowerState’ and ‘CreditScoreRangeUpper vs BorrowerState’. The observation I made was the similiarites in variance of each of these credit ranges of each state. That is to say that the variance in upper and lower credit scores were almost the same, per state.
Fast forward to the part we use ggplot and facet_wrap, we can see here that is easier to spot any spike in count of loans according to their loan status, but also contrasted against any of the other variables, and their respective levels. For example, in Count vs EmploymentStatus facet_wrapped around Loanstatus we can see that there are major spike for employed and fulltime borrowers under the categories of chargedoff, completed and current. Another example are the spikes in the Count vs Ocupation vs LoanStatus graph; these spikes represent the high count of borrowers under the “Other” Occupation status, in the “Chargedoff”, “Current” and “Completed” categories of LoanStatus.
OVerall, these Bivariate graphing methods have served to better interpret the hidden relationships between variables. By classifying Data points further the addition of more discriminating factors we have brought a better understanding in our EDA rough draft, which may lead one to believe that consecutive addition of other variables will help understand this data set in new better ways.
Since our subjects of interest are those which are past due on their payments, or have defaulted on their loans then we will go ahead and limit our data set to borrowers who fall under these categories.
csv_file.reassesedLS <-
csv_file.reassesedEmp[which(
csv_file.reassesedEmp$LoanStatus == 'Past Due (1-15 days)'|
csv_file.reassesedEmp$LoanStatus =='Past Due (16-30 days)'|
csv_file.reassesedEmp$LoanStatus =='Past Due (31-60 days)'|
csv_file.reassesedEmp$LoanStatus =='Past Due (61-90 days)'|
csv_file.reassesedEmp$LoanStatus =='Past Due (91-120 days)'|
csv_file.reassesedEmp$LoanStatus =='Past Due (>120 days)'|
csv_file.reassesedEmp$LoanStatus =='Defaulted'),]
We should notice the count reduction of borrowers from 106221 to 6144.
Let’s make sure that only those levels of LoanStatus we are interested in have trickle down into our new dataset
unique(csv_file.reassesedLS$LoanStatus)
## [1] Past Due (1-15 days) Defaulted Past Due (16-30 days)
## [4] Past Due (61-90 days) Past Due (31-60 days) Past Due (91-120 days)
## [7] Past Due (>120 days)
## 12 Levels: Cancelled Chargedoff Completed Current ... Past Due (91-120 days)
Before we had notice that borrowers with a ‘Employed’ employment status made up most of the mass of past due loans (including all of its variations in past due time), as opposed to other employment statuses. We will also look at the body of employed borrowers with ‘Defaulted’ loans. Now let’s take this information and contrast it against other variables.
Let’s reassess our data set again to narrow it down to those employed borrowers that fall under this Loan statuses.
csv_file.reassesedLSandEmp <- subset(csv_file.reassesedLS,
csv_file.reassesedLS$EmploymentStatus ==
'Employed')
Since we have narrowed our dataset further down into loan listings of employed borrowers whose loan status is not favorable we will now investigate the relations of this new set to the other variables. Here we will make a boxplot of our loan amounts. We do so in order to look at the variability of loan amounts according to any given categorical variable that fall under ‘Defaulted’ and all ‘Past Due’ variations of status.Remember, these are all employed borrowers.
Let’s reorganize our variables for a better visualization of our data:
csv_file.drops <- csv_file.reassesedLSandEmp[,c("LoanStatus",
"IncomeRange",
"IsBorrowerHomeowner",
"IncomeVerifiable",
"EmploymentStatus",
"FirstRecordedCreditLine",
"EmploymentStatusDuration",
"OpenCreditLines",
"DelinquenciesLast7Years",
"CreditScoreRange",
"CreditScoreRangeLower",
"CreditScoreRangeUpper",
"TotalCreditLinespast7years",
"TotalTrades",
"CurrentDelinquencies",
"TradesOpenedLast6Months",
"StatedMonthlyIncome",
"CurrentCreditLines",
"AmountDelinquent",
"DebtToIncomeRatio",
"LoanOriginalAmount")]
In this case, ggduo can narrow down some work, this is because it can graph one variable of our dataset against all others simultanously. Here, we are comparing “LoanStatus” against the rest of the variables in separate chunks for better visualization.
drops <- c('Term',
'ListingKey',
'Occupation',
'BorrowerState',
'IsBorrowerHomeowner',
'EmploymentStatus',
'IncomeRange',
'IncomeVerifiable')
csv_file.drops <- csv_file.drops[,!(names(csv_file.drops) %in% drops)]
ncol(csv_file.drops)
## [1] 17
ggduo(csv_file.drops, 1, 2:17)
## Warning: Removed 85 rows containing non-finite values (stat_boxplot).
This graph is too dense to be able to see it clearly. Let’s divide this group of graphs into multiple for a better look at the graphed data.
ggduo(csv_file.drops, 1, 2)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$FirstRecordedCreditLine)
## Min. 1st Qu. Median Mean 3rd Qu.
## "1958-01-01" "1991-04-27" "1996-08-06" "1995-05-29" "2000-05-24"
## Max.
## "2011-04-19"
There doesn’t seem to be a lot of relative variation in “FirstRecordedCreditLine” across time, in terms of these loan statuses; with the exception of “Past Due (>120 days)” (We previously noticed that this category is a very small subset of our data). Given this, we can better the similarities between faulty loan status through in terms of this new variable.
ggduo(csv_file.drops, 1, 3)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$EmploymentStatusDuration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 28.00 69.00 94.99 135.50 527.00
“EmploymentStatusDuration” works as another aspect employed faulty-borrowers of differnt loan status categories have in common, with minor variations.
ggduo(csv_file.drops, 1, 4)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$OpenCreditLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 8.000 9.008 12.000 33.000
We should make a remark about this graph on how the majority of employed borrowers, per loan status category is centers towards the lower side of the “OpenCreditLine” domain, with a few instances of outliers possesing more than 20 open credit lines at the time of loan acceptance.
ggduo(csv_file.drops, 1, 5)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$DelinquenciesLast7Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.409 3.000 89.000
The amount of delinquencies outliers per Loan status (aside from “Past Due (>120 days)”) is remarkable in this graph, as oppososed to previous ones. Specially since delinquencies sound like they could be an important factor to take into account when considering accepting a borrower’s loan request.
ggduo(csv_file.drops, 1, 7)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$CreditScoreRange)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19 19 19 19 19 19
We previously noted this odd coincidence. This an important fact to keep in mind for later when the in our multivariate analysis and reflection
ggduo(csv_file.drops, 1, 8)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$CreditScoreRangeLower)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 680.0 689.8 720.0 860.0
“CreditScoreRangeLower” seems to be one of our most consistant variables to graph. It seems as employed borrowers belonging to the different, faulty loan status categories share very similar low-credit-score ranges.
ggduo(csv_file.drops, 1, 9)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$CreditScoreRangeUpper)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 619.0 679.0 699.0 708.8 739.0 879.0
Since we know that the variation in credit score ranges per borrower can only vary by +/-19 point or none, it only makes sense that this graph is the same as previous where most values were just increased by 19 points
ggduo(csv_file.drops, 1, 10)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$TotalCreditLinespast7years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 18.00 26.00 27.72 35.00 97.00
“TotalCreditLinespast7years” is another variable that really impressed me. The fact more than half of all borrowers, each posses about 20 open credit lines seems counterintuitive in my opinion. Shouldn’t someone with these many open lines of credit be considered a liability?
ggduo(csv_file.drops, 1, 11)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$TotalTrades)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 22.00 23.33 30.00 75.00
ggduo(csv_file.drops, 1, 12)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$CurrentDelinquencies)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4841 0.0000 17.0000
ggduo(csv_file.drops, 1, 13)+
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))
summary(csv_file.drops$TradesOpenedLast6Months)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.9359 1.0000 8.0000
Furthermore, we can better describe the graphed ranges of each:
csv_file.drops <- csv_file.reassesedLSandEmp[,c("LoanStatus",
"IncomeRange",
"IsBorrowerHomeowner",
"IncomeVerifiable",
"EmploymentStatus",
"FirstRecordedCreditLine",
"EmploymentStatusDuration",
"OpenCreditLines",
"DelinquenciesLast7Years",
"CreditScoreRange",
"CreditScoreRangeLower",
"CreditScoreRangeUpper",
"TotalCreditLinespast7years",
"TotalTrades",
"CurrentDelinquencies",
"TradesOpenedLast6Months",
"StatedMonthlyIncome",
"CurrentCreditLines",
"AmountDelinquent",
"DebtToIncomeRatio",
"LoanOriginalAmount")]
We finally, redefine csv_file.drops to use next some of the columns that we previously got rid off.
Here we will graph 3 varibles at once, controlling for employment status and loan status. We will also graph linear models.
ggplot(aes(x = DelinquenciesLast7Years,
y = OpenCreditLines,
color = LoanStatus ),
data = csv_file.drops)+
geom_point(alpha = 0.1, size =1)+
geom_smooth(method = "lm", se = FALSE,size = 0.5)+ #lm = linear model
scale_color_brewer(type='seq',
palette=7,
guide=guide_legend(title='LoanStatus'))
summary(csv_file.drops$OpenCreditLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 8.000 9.008 12.000 33.000
Unsurprisingly, according to the model, the greater the amount of delinquencies per borrower the lower the amount of Credit Lines these were granted, accross all faulty loan statuses.
ggplot(aes(x = DelinquenciesLast7Years,
y = StatedMonthlyIncome,
color =LoanStatus ),
data = csv_file.drops)+
geom_point(alpha = 0.1, size =1)+
geom_smooth(method = 'lm', se = FALSE, size = 0.5)+
scale_color_brewer(type = 'seq',
palette = 7,
guide=guide_legend(title='LoanStatus'))
summary(csv_file.drops$StatedMonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.25 3167.00 4583.00 5320.00 6585.00 33330.00
Here, there almost seems to be no relation among “StatedMonthIncome” and “DelinqueciesLast7Years” regardless of faulty loan status by employed borrowers.
ggplot(aes(x = StatedMonthlyIncome,
y = LoanOriginalAmount,
color = LoanStatus),
data = csv_file.drops)+
geom_point(alpha = 0.1, size =1)+
geom_smooth(method = 'lm', se = FALSE, size = 0.5)+
scale_color_brewer(type = 'seq',
palette = 7,
guide=guide_legend(title='LoanStatus'))
In this graphs we notice a few characteristics like the negative trend between open credit lines and delinquencies in the last 7 years among borrowers, with all unfavorable levels of loan status, who are employed. Also, there is a positive trend between stated monthly income and loan original amount, with all unfavorable levels of loan status, for borrowers who are employed.
Let’s take a look at some interaction with a correlation value of higher than 0.3 (Significant):
“AmountDelinquent” and “TotalCreditLinespast7years” | cor 0.41
“StatedMonthlyIncome” and “LoanOriginalAmount” | cor 0.335
“LoanOriginalAmount” vs “CreditScoreRangeUpper(Lower)” | cor 0.323
The correlation between amount delinquent and total credit lines opened in the past 7 years may be so relatively highly correlated since it makes sense that the greater the net-credit (sum of all the given credit in every line) a borrower possess, the more room there is for a borrower to owe, even past a delinquent status.
Although the correlation between “LoanOriginalAmount” vs “CreditScoreRangeUpper(Lower)” and “StatedMonthlyIncome” vs “LoanOriginalAmount” are considered significant, They are not as highly correlated as I expected them to be. At first, under my logic, I suspected that these respectively compared variables would have a correlation of close to 1 since it made sense that the more money you make, the more you can ask for in a loan; By contrast, the lower a borrower’s lowest credit score range, the least likely they are to get more money in their loan. But then I realized that I was not taking the other variables into account at the moment, and that I had over estimated the impact of these two variables on “LoanOriginalAmount”.
I believe that this is the best plot from our univariable plot section. Since our main variable of interest is the our “LoanStaus” variable, we can use this plot to clealy depict the magnitude of each of it’s levels. We can see that although our unfavorable loan status are relatively uncommon, such loans still exist. From here we can better distinguish the rest of our dataset to learn from these individuals with unfavorable loan statuses.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Here, just like in the last plot we are showcasing the magnitudes of each possible loan status, but we are using the fill aes argument to visually separate the different employment statuses that any borrower could have noted at the time of listing. This addition serves to demonstrate the composition of each of this loan statuses in terms of employment statuses. we can also use the coord_cartesian wrap to zoom in onto a single loan status level at a time (or more), and the scale_y_continous wrap to zoom in vertically_wise.
I was able to singled out each column in this graph, which helped in getting a sense of percetage composition of each one of these columns by employment status. Later on I was able to futher filter my data to investigate that employment status which seem to make part of those Loan statuses of interest.
Here we visualize all trends of each level of loan status against the amount of loan original amount vs the delinquencies in the last 7 years of each borrower who is employed. I believe it is important to highlight these relations since they demonstrate the increasing tendency in loan original amounts according to monthly income for all unfavorable loan status, but at different rates.
It was an interesting project to work on. It served me in helping me build a strategy with which to approach a data set for explorational purposes. Better than that is the fact that I got the chance to do so with a dataset of the financial sector, which peaks my interest.
Some of the struggles I faced at first were finding a purpose for any information I could extract, and selecting my variables carefully, since this is a large set of data. After that, I struggled with few bugs in Rstudio with the graph image resizing (which ultimately led me to discover the “theme” wrap for the functions qplot an ggplot, and jpeg image saving, although I did not use this one in this project). I also found another way to apply a series of conditional filters to a whole dataset, which then led me to discover the “which”" argument in the multivariable plot section.
The biggest surprise to me was the plotted the composition of loans per loan status, in terms of employment status in the bivariate plot section. It is clearly visible that those unfavorable status loans were mostly made up of employed borrowers. Although a “past due” does not necessarily imply that a loan will be defaulted on, it does raise question as to why those who are employ tend be the majority of the crowd of those who miss payments, and subsequently may be predisposed to default on their loans. Another surprise was finding how similar most borrower with unfavorable loan statuses are to the rest (I found this with code that can be found in my brainstorm section below) Although not exactly similar to the rest of the borrowers, our borrowers in the csv_file.drops data set, share a lot of characteristics with their counter parts for example: their high and low credit score ranges had a similiar variation, their employment status durations were very similar and so was the spread of original loan amounts. There are even more statistical comparison that can be seen below.
Obviously, there is a lot more relationships that are hidden behind this data, and even more if one is to look at the entire original dataset. Some of the other variables, and impact on loan status, I would have liked to investigate are “BorrowerAPR” and “BorrowerRate”; these two could potentially be some of the biggest factors in predicting whether a loan will be defaulted on or not.
Everything below this point was part of all roughs drafts I previously made. I kept all of this here in case it would come in handy later on, not just the code but the ideas as well.
ggplot(aes(x = BorrowerState, y = LoanOriginalAmount), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
ggplot(aes(x = Occupation, y = LoanOriginalAmount), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
## CreditScoreRangeLower vs BorrowerStates
ggplot(aes(x = BorrowerState, y = CreditScoreRangeLower), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
## CreditScoreRangeLower vs Occupation
ggplot(aes(x = Occupation, y = CreditScoreRangeLower), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
It is easy now to spot the variability of The lowest reported credit scores for all employed customers who have unfavorable loan statuses by their state of residence or by their profession. The ‘CreditScoreRangeLower’ vs ‘BorrowerState’ reveals that Idaho, Oklahoma and Wisconsin are some of the states with the largest variability. Alaska possess the lowest variability. Finally, Wyoming presents itself with the same variability of other states but at an overall higher range than others. In other words, Wyomings’ borrower’s credit scores tend to be higher than most others.
The “Occupation” vs “CreditScoreRangeLower” demonstrates that among those employed borrowers with unfavorable loan statuses, borrowers who are mdeical technicians have some of the highest spreads of Lower range credit scores. Whereas chemists, judges, borrower’s who work in the landscaping business and college freshmen/junior/Sophomore are amongst those with the lowest spread.
## CreditScoreRangeUpper vs BorrowerStates
ggplot(aes(x = BorrowerState, y = CreditScoreRangeUpper), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
## CreditScoreRangeUpper vs Occupation
ggplot(aes(x = Occupation, y = CreditScoreRangeUpper), data =
csv_file.reassesedLSandEmp)+
geom_boxplot()+
theme(axis.text.x=element_text(angle=90,vjust=1,hjust=1))
Just like we saw before WY has some of the best credit scores, in this case, the upper range. Like we noticed before ID, OK and WI have some of the greatest spread, but unlike before we notice that CT also possess a higher than normal spread, in the upper range of course.
At this point we have seen some of the few combination of interaction among a few variables of the latest subset of our original dataset.
csv_file.counterDrops will be a new subset of our orginal data set in which you can find all the rows that are NOT in the csv_file.drops subset, but it will only contain the same columns as does csv_file.drops, this will help compare our specific subjects of interest againt the larger majority that is not employed AND possess an unfavorable loan status.
# make csv_file.counterDrops
# we are going to write csv_file2, the file against which we will compare
# csv_file.drops in order to find those rows the belong in
# csv_file.counterdrops, but we will do so, in the same column order as
# csv_file.drops
csv_file.counterDrops <- csv_file[,c("LoanStatus",
"EmploymentStatus",
"FirstRecordedCreditLine",
"EmploymentStatusDuration",
"OpenCreditLines",
"DelinquenciesLast7Years",
"CreditScoreRange",
"CreditScoreRangeLower",
"CreditScoreRangeUpper",
"TotalCreditLinespast7years",
"TotalTrades",
"CurrentDelinquencies",
"TradesOpenedLast6Months",
"StatedMonthlyIncome",
"CurrentCreditLines",
"AmountDelinquent",
"DebtToIncomeRatio",
"LoanOriginalAmount")]
csv_file.drops <- csv_file.drops[,c("LoanStatus",
"EmploymentStatus",
"FirstRecordedCreditLine",
"EmploymentStatusDuration",
"OpenCreditLines",
"DelinquenciesLast7Years",
"CreditScoreRange",
"CreditScoreRangeLower",
"CreditScoreRangeUpper",
"TotalCreditLinespast7years",
"TotalTrades",
"CurrentDelinquencies",
"TradesOpenedLast6Months",
"StatedMonthlyIncome",
"CurrentCreditLines",
"AmountDelinquent",
"DebtToIncomeRatio",
"LoanOriginalAmount")]
csv_file.counterDrops <-
csv_file.counterDrops[which(
csv_file.counterDrops$EmploymentStatus != "Employed"&(
csv_file.counterDrops$LoanStatus !='Past Due (1-15 days)'|
csv_file.counterDrops$LoanStatus !='Past Due (16-30 days)'|
csv_file.counterDrops$LoanStatus !='Past Due (31-60 days)'|
csv_file.counterDrops$LoanStatus !='Past Due (61-90 days)'|
csv_file.counterDrops$LoanStatus !='Past Due (91-120 days)'|
csv_file.counterDrops$LoanStatus !='Past Due (>120 days)'|
csv_file.counterDrops$LoanStatus !='Defaulted')),]
unique(csv_file.counterDrops$EmploymentStatus)
## [1] Self-employed Not available Full-time Other
## [6] Not employed Part-time Retired
## 9 Levels: Employed Full-time Not available Not employed ... Self-employed
unique(csv_file.counterDrops$LoanStatus)
## [1] Completed Defaulted Current
## [4] Chargedoff Past Due (1-15 days) Cancelled
## [7] Past Due (16-30 days) Past Due (31-60 days) FinalPaymentInProgress
## [10] Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 12 Levels: Cancelled Chargedoff Completed Current ... Past Due (91-120 days)
names(csv_file.counterDrops)
## [1] "LoanStatus" "EmploymentStatus"
## [3] "FirstRecordedCreditLine" "EmploymentStatusDuration"
## [5] "OpenCreditLines" "DelinquenciesLast7Years"
## [7] "CreditScoreRange" "CreditScoreRangeLower"
## [9] "CreditScoreRangeUpper" "TotalCreditLinespast7years"
## [11] "TotalTrades" "CurrentDelinquencies"
## [13] "TradesOpenedLast6Months" "StatedMonthlyIncome"
## [15] "CurrentCreditLines" "AmountDelinquent"
## [17] "DebtToIncomeRatio" "LoanOriginalAmount"
summary(csv_file.counterDrops)
## LoanStatus EmploymentStatus
## Completed :25742 Full-time :26355
## Chargedoff : 8463 Self-employed: 6134
## Current : 7647 Not available: 5347
## Defaulted : 4388 Other : 3806
## Past Due (1-15 days) : 131 : 2255
## Past Due (31-60 days): 70 Part-time : 1088
## (Other) : 174 (Other) : 1630
## FirstRecordedCreditLine EmploymentStatusDuration OpenCreditLines
## Min. :1947-08-24 Min. : 0.00 Min. : 0.000
## 1st Qu.:1989-11-01 1st Qu.: 19.00 1st Qu.: 5.000
## Median :1995-03-27 Median : 51.00 Median : 8.000
## Mean :1994-02-21 Mean : 83.29 Mean : 8.555
## 3rd Qu.:1999-08-19 3rd Qu.:114.00 3rd Qu.:11.000
## Max. :2012-02-14 Max. :755.00 Max. :51.000
## NA's :697 NA's :7624 NA's :7604
## DelinquenciesLast7Years CreditScoreRange CreditScoreRangeLower
## Min. : 0.000 Min. :19 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:19 1st Qu.:620.0
## Median : 0.000 Median :19 Median :660.0
## Mean : 4.784 Mean :19 Mean :665.4
## 3rd Qu.: 4.000 3rd Qu.:19 3rd Qu.:720.0
## Max. :99.000 Max. :19 Max. :880.0
## NA's :990 NA's :591 NA's :591
## CreditScoreRangeUpper TotalCreditLinespast7years TotalTrades
## Min. : 19.0 Min. : 2.0 Min. : 0.00
## 1st Qu.:639.0 1st Qu.: 15.0 1st Qu.: 13.00
## Median :679.0 Median : 23.0 Median : 20.00
## Mean :684.4 Mean : 25.1 Mean : 21.71
## 3rd Qu.:739.0 3rd Qu.: 33.0 3rd Qu.: 29.00
## Max. :899.0 Max. :136.0 Max. :126.00
## NA's :591 NA's :697 NA's :7544
## CurrentDelinquencies TradesOpenedLast6Months StatedMonthlyIncome
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 2550
## Median : 0.0000 Median : 0.000 Median : 4000
## Mean : 0.9981 Mean : 0.876 Mean : 4842
## 3rd Qu.: 1.0000 3rd Qu.: 1.000 3rd Qu.: 5833
## Max. :83.0000 Max. :17.000 Max. :1750003
## NA's :697 NA's :7544
## CurrentCreditLines AmountDelinquent DebtToIncomeRatio LoanOriginalAmount
## Min. : 0.000 Min. : 0 Min. : 0.000 Min. : 1000
## 1st Qu.: 6.000 1st Qu.: 0 1st Qu.: 0.130 1st Qu.: 2550
## Median : 9.000 Median : 0 Median : 0.210 Median : 4500
## Mean : 9.782 Mean : 1026 Mean : 0.313 Mean : 6233
## 3rd Qu.:13.000 3rd Qu.: 0 3rd Qu.: 0.320 3rd Qu.: 8000
## Max. :52.000 Max. :444745 Max. :10.010 Max. :35000
## NA's :7604 NA's :7622 NA's :7128
summary(csv_file.drops)
## LoanStatus EmploymentStatus
## Past Due (1-15 days) :675 Employed :2355
## Defaulted :630 : 0
## Past Due (31-60 days) :293 Full-time : 0
## Past Due (61-90 days) :262 Not available: 0
## Past Due (91-120 days):255 Not employed : 0
## Past Due (16-30 days) :226 Other : 0
## (Other) : 14 (Other) : 0
## FirstRecordedCreditLine EmploymentStatusDuration OpenCreditLines
## Min. :1958-01-01 Min. : 0.00 Min. : 0.000
## 1st Qu.:1991-04-27 1st Qu.: 28.00 1st Qu.: 6.000
## Median :1996-08-06 Median : 69.00 Median : 8.000
## Mean :1995-05-29 Mean : 94.99 Mean : 9.008
## 3rd Qu.:2000-05-24 3rd Qu.:135.50 3rd Qu.:12.000
## Max. :2011-04-19 Max. :527.00 Max. :33.000
##
## DelinquenciesLast7Years CreditScoreRange CreditScoreRangeLower
## Min. : 0.000 Min. :19 Min. :600.0
## 1st Qu.: 0.000 1st Qu.:19 1st Qu.:660.0
## Median : 0.000 Median :19 Median :680.0
## Mean : 4.409 Mean :19 Mean :689.8
## 3rd Qu.: 3.000 3rd Qu.:19 3rd Qu.:720.0
## Max. :89.000 Max. :19 Max. :860.0
##
## CreditScoreRangeUpper TotalCreditLinespast7years TotalTrades
## Min. :619.0 Min. : 2.00 Min. : 1.00
## 1st Qu.:679.0 1st Qu.:18.00 1st Qu.:15.00
## Median :699.0 Median :26.00 Median :22.00
## Mean :708.8 Mean :27.72 Mean :23.33
## 3rd Qu.:739.0 3rd Qu.:35.00 3rd Qu.:30.00
## Max. :879.0 Max. :97.00 Max. :75.00
##
## CurrentDelinquencies TradesOpenedLast6Months StatedMonthlyIncome
## Min. : 0.0000 Min. :0.0000 Min. : 0.25
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 3166.67
## Median : 0.0000 Median :1.0000 Median : 4583.33
## Mean : 0.4841 Mean :0.9359 Mean : 5320.30
## 3rd Qu.: 0.0000 3rd Qu.:1.0000 3rd Qu.: 6585.00
## Max. :17.0000 Max. :8.0000 Max. :33333.33
##
## CurrentCreditLines AmountDelinquent DebtToIncomeRatio LoanOriginalAmount
## Min. : 0.0 Min. : 0 Min. : 0.0100 Min. : 1000
## 1st Qu.: 6.0 1st Qu.: 0 1st Qu.: 0.1600 1st Qu.: 4000
## Median : 9.0 Median : 0 Median : 0.2300 Median : 6000
## Mean :10.1 Mean : 1417 Mean : 0.3009 Mean : 8051
## 3rd Qu.:13.0 3rd Qu.: 0 3rd Qu.: 0.3500 3rd Qu.:10000
## Max. :34.0 Max. :183396 Max. :10.0100 Max. :35000
## NA's :85
The point of this last section was to compare our subjects of interest to their counterparts in a more detailed aspect. More information can be derived from these summaries alone.